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5-H ' Abstract 

Finding "densely connected clusters" in a graph is in general an important and well studied problem in the 
literature [T2]. It has various applications in pattern recognition, social networking and data mining [111 I14j . 
Recently, Ames and Vavasis have suggested a novel method for finding cliques in a graph by using convex 
optimization over the adjacency matrix of the graph [2] [3]. Also, there has been recent advances in decomposing 
a given matrix into its "low rank" and "sparse" components 4, 5 . In this paper, inspired by these results, we 
view "densely connected clusters" as imperfect cliques, where imperfections correspond missing edges, which 
are relatively sparse. We analyze the problem in a probabilistic setting and aim to detect disjointly planted 
clusters. Our main result basically suggests that, one can find dense clusters in a graph, as long as the clusters 
are sufficiently large. We conclude by discussing possible extensions and future research directions. 
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Recently, convex optimization methods have become increasingly popular for data analysis. For example, in 
\^ , compressed sensing [T], we observe the measurements and aim to recover an unknown sparse solution of a system 
QO ' of linear equations via t\ minimization. In many other cases, we have the perfect knowledge of a signal which 
, possibly looks complicated, however it has a simpler underlying structure and we aim to reveal this structure by 
' decomposing it into meaningful pieces. For example, decomposing a signal into a sparse superposition of sines 
and spikes is one of the well-known problems of this type [13] . Decomposing a matrix into low rank and sparse 
components is another key problem of this nature and it has recently been studied in various settings [H |H1 H] ■ 
In this problem, we observe the matrix L° + S°, where L° is low rank and S° is sparse, and we aim to find L° and 
. . ' S°. The suggested convex optimization program is as follows: 



■ subjeH to 



min||L|U + A||S||i (1) 
set to 

L + S = L° + S° 



Here || • || + is the nuclear norm i.e. sum of the singular values of a matrix and || • ||i is the l\ norm, i.e., the sum 
of the absolute values of the entries. Problem ([!]) can be considered as the natural convex relaxation of "low rank 
+ sparse" decomposition as l\ norm and nuclear norm are the tightest convex relaxations of the sparsity and rank 
functions respectively. Consequently, this program promotes sparsity for S and low rankness for L. For the correct 
choice of A, if L° and S° satisfies certain incoherence requirements, it is known that we'll have (L* , S*) — (L°, S°) 
where (L*,S*) is output of problem (JTJ. 

This result is actually very useful as low rankness and sparsity are the underlying structures in many problems. 
In jS], Gaussian graphical models with latent variables were investigated and the problem of finding conditional 
dependencies of the observed variables was connected to problem ([]}. On the other hand, in the problem of 
finding cliques in a given unweighted graph, the key observation is the fact that, in the adjacency matrix, a clique 
corresponds to a submatrix of all l's which is clearly rank 1. Based on these observations, in this paper, we aim to 
extend the results of Ames and Vavasis [3] for detection of the planted cliques to detection of "densely connected 

"This work was supported in part by the National Science Foundation under grants CCF-0729203, CNS-0932428 and CCF-1018927, 
by the Office of Naval Research under the MURI grant N00014-08- 1-0747, and by Caltech's Lee Center for Advanced Networking. 
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clusters" . This problem comes up naturally, as most of the times, it might be unreasonable to expect full-cliques in 
a graph. For example, there might be missing edges naturally, or data might be corrupted or we might be observing 
only partial information. However, even if we miss some of the edges, it is very likely that most of the edges will 
be preserved and the cluster, we want to identify, will still be denser than the rest. We'll view these dense clusters 
as imperfect cliques with some missing edges and, in our approach, full cliques will correspond to low rank piece, 
L°, whereas the missing edges inside and (extra) edges outside of the clusters will correspond to sparse piece, S* . 

We analyze the problem under a general probabilistic setting, which we call "probabilistic cluster model" . In 
our model, an edge inside i'th cluster exists with probability pi and an edge which is not inside any of the clusters 
exists with probability q independent of other edges in the graph, where 1 > pi > q > are constant. Here, by 
"inside a cluster" , we mean an edge lying between two nodes which belong to the same cluster. Notice that, this 
model can be viewed as a slight modification of well-known Erdos-Renyi random graph model where we introduce 
a nonuniform distribution which makes the clusters identifiable. We additionally assume the clusters are disjoint. 

We'll analyze two convex programs for detection of the clusters using the knowledge of the graph. We name 
the first program "blind approach" and it is just a slight modification to problem ([T]), given in (|10[) , and we show 
that if 

1 

mm pi = p min > - > q (2) 
i I 

as long as the clusters are sufficiently large, with high probability, problem (p~0|) can detect the clusters. Our 
second program is called "intelligent approach" which is given in problem (|15p. In this case, we require an extra 
information but we can guarantee the detection for any p m i n > q. Problem (|15|) can be considered as a mixture 
of (fTJ) and (12) of [2] because it focuses on the subgraph induced by the edges inside the clusters similar to (12) 
of [2] but additionally accounts for the missing edges. This approach also trivially extends to the case where we 
observe the partial graph, in which, each edge is observed with same probability independent of others. In this 
case, clusters can still be recovered but we need clusters to be slightly larger compared to the case we observe the 
full graph. 



2 Basic Definitions and Notations 

Let [m] denote the set {1,2,..., m} for all integers m > 1. We differentiate a subset of nodes in a graph by calling 
that subset a cluster. For the rest of the paper, we assume the graph Q is unweighted with n nodes, and there are 
t disjoint planted clusters with sizes {ki}\ = i nodes. By unweighted we mean edges do not carry weights. Assume 
nodes are labeled from 1 to n and let Ci be the set of the nodes inside the cluster hence Ci C [n], \Ci\ = ki and 
CiDCj = for any i ^ j. We also let C t +i denote rest of the nodes i.e. C t +\ = [n] — \J i=1 Ci and k t +i = n— Y^i=i 

We call a subset /? of [c] x [d], a region. (3 C denotes the complement, which is given by (3 C = [c] x [d] — (3. 

Let 1Z be the region corresponding to the union of regions induced by the clusters, i.e., 1Z — (J i=1 Ci x Ci. Note 
that 1Z is simply a subset of [n] x [n]. We also let IZi.j = CiX Cj for 1 < i, j < t + 1. {IZij} basically divides [n] x [n] 
into (t + l) 2 disjoint regions similar to a grid. Also Hii is simply the region induced by «'th cluster for any i < t. 

Let a,del and < r < 1. We say a random variable X is Bern(a, b, r) if 

P(A = a) = r (3) 
P{X = b) = l-r (4) 

For a given matrix X, Xjj = (X)i j denotes the entry lying on i'th row and j'th column. f cxd is a c x d matrix 
where entries are all l's. Assume (3 is a subset of [c] x [d]. Then, j3 can be viewed as a set of coordinates and if 
X £ M. cxd , we denote the matrix which is induced by entries of X on j3 by X^: 

™- = \0 else (5) 

In particular, tp Xd is a matrix, whose entries on j3 are 1 and rest of the entries are 0. 
Now, we introduce some definitions to explain the model we'll work on. 

Definition 1 (Random Support). A random set f3 C [c] x [d] is called "random support" with parameter < r < 1 
if each coordinate (i,j) £ [c] x [d] is an element of f3 with probability r, independent of other coordinates. 

A random set T £ [c] x [d] is called "corrected random support" with parameter r if it is statistically identical to 
/JuUrJi M (v) where [3 is a random support with parameter r. Basically, we include the diagonal coordinates. 
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Let A be the adjacency matrix of Q. For simplicity, we let A^j = 1 for all i £ [n]. Also for £ [n] x [n], i 7^ j 



ll if an edge exists between nodes i, j , . 

= (0 else (6) 

Note that, A is symmetric, i.e., Ajj = A,,-^ for all i, j £ [n], as a result, it is uniquely determined by the entries on 
the lower triangular part. 

Definition 2 (Probabilistic Cluster Model). Recall thatCi C [n] with Ci<~)Cj = and |Cj| = fc, for all 1 < i ^ j < t. 
Also TZi j — Ci x Cj /or aZZ 1 < i, j < t + 1 and 1Z — Uj=i 1 ^ e constants between and 1. Then, 

a random graph Q , generated according to probabilistic cluster model, has the following adjacency matrix. Entries 
of A on the lower triangular part are independent random variables and for any i > j : 

^ I Bern(l,0,pi) random variable if £ TZi.i for some I < t 
1 Bern{\, 0, random variable else 

Verbally, an edge inside Tth cluster exists with probability pi and an edge which is not inside any of the clusters 
exists with probability q, independent of other edges, where 1 > {pi}* =1 ,<7 > 0. In order to distinguish clusters 
we'll assume they are denser i.e. an edge inside the region 1Z is more likely to exist compared to an edge which is 
not. Consequently, we have: 

min pi = p rnin > q (8) 

i<t 

for the rest of the paper. One can similarly treat the case where maxi< t pi = p max < q by considering the 
complement graph H whose adjacency Bij = 1 — Aij for all i ^ j. In this case, H will still satisfy probabilistic 
model with inside and outside cluster edge probability {1 — Pi}\=ii 1 — 9 respectively where mini< t 1 — pi > 1 — q. 
Notice that, in the special case of cliques, we have pi = 1 for all i < t. 
In this model, A can be characterized also by using random supports. 

t 

i=l 

where {f3i},T are independent corrected random supports with parameters {pt},q respectively. 

Let A C [n] x [n] be the set of nonzero coordinates of A, i.e., l^ x " = A. Basically, A is the region induced by 
the edges inside the graph Q with the addition of diagonal coordinates. For example, the set A c n 1Z corresponds 
to the missing edges inside the clusters. Clearly, A is random, as Q is drawn from probabilistic cluster model. 

We'll call a matrix (or vector) positive (negative) if all its entries are positive (negative). Finally, we let sum(X) 
denote sum of the entries of X i.e. sum(X) = Ej=i ^.3 f° r X G M cxd . If matrix X is nonnegative then 

sum(X) = ||X||i. ' 



3 Proposed Convex Programs 

Our aim is finding the clusters {C^}* =1 in a graph Q drawn from the probabilistic cluster model described in 
Definition [2l This can be achieved by finding 1Z. This is not hard to see, because, in the matrix l^ x ™, nonzero 
entries of each column will exactly correspond to one of the clusters, as clusters are disjoint. Then, we can simply 
scan through all columns to find the clusters. 

3.1 Blind Approach 

As our first approach, in order to find 1Z, we suggest the following, slightly modified version of problem (fTJ 

minllLlU + AIISIl! (10) 
subject to 

1 > L itj > for all i,j (11) 
L + S = A (12) 
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Advantage of this approach is the fact that we don't need any additional information about clusters such as 
number (or sizes) of the clusters. The desired solution is (L°,S°) where L° corresponds to the full cliques, when 
missing edges inside 1Z are completed, and S° corresponds to the missing edges and the extra edges between the 
clusters. In particular we want: 

L° = l^ xn (13) 

It is easy to see that the (L°, S°) pair is feasible, later we'll argue that under correct assumptions (L°, S°) is indeed 
unique optimal solution. 



3.2 Intelligent Approach 

The second convex problem to be analyzed is a mixture of problems ([T]) and (12) of [2]. We'll require an extra 
information which is the size of the region induced by clusters, i.e., \1Z\. Suggested program focuses on subgraph 
induced by the edges inside the clusters and is given below: 

min||L||* + A||S||i (15) 

-Li ,0 

subject to 

1 > L itj > Sij > for all i,j (16) 

trace((l" xn - A) T (L - S)) = (17) 
t 

sum(L) > \7Z\ = k f (18) 

Actually, knowledge of 1 72. |, will help us guess the solution of problem (fT5]) (under the right assumptions). L° should 
correspond to the full cliques similar to (|15p. however 5° should only correspond to the missing edges inside the 
clusters. Formally, we want: 

L° = l£ x ™ (19) 
5° = l'£nn (20) 

In the next section, we'll state the main results of the paper regarding the problems [TU] and [15] Proofs of the 
theorems in section 2] will be given in sections H3 [7] and [5] Finally, section [5] will conclude the paper. 



4 Main Results 

In this section, we'll explain the conditions for which the candidates given in (|13l) and (|19[) are the unique optimal 
solutions of problems (|10[) and (|15l) respectively. This will also naturally answer the question of finding the densely 
connected clusters {Ci}* =1 . Let k m i n be the size of the minimum cluster 

k min = min ki (21) 
i<i<t 

and Pmin was given previously in ([8]). Our analysis yields the following following fundamental constraints. 

• Xy/n < C for some constant C (In particular A = ^= will work). 

• ki > x( p 1 ~q) f° r a ^ * — 

Actually, both of these constraints are natural. In |4], A = is used as the weight for problem ((T|). It is not 
surprising that we are using a similar weight as our random graph model has strong similarities with the uniformly 
random support of the sparse component in [3]. Secondly, we observe that A < ^= implies ki > c ^_ q ^ which 
suggests that for recoverability, we need size of the i'th cluster to be at least fl(y/n) and as pi — q gets smaller, this 
size should grow. This condition is consistent with the previous results of [2] [3] which says for recoverability of t 
disjoint cliques, one needs a minimum clique size of fl(^/n). 

The main results of this paper are summarized in the following theorems. 
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Theorem 1 (Main Result for Intelligent Approach). Set A = 2~^- Let Q be a random graph generated according 
to the probabilistic cluster model (0) with cluster sizes {ki}\ =1 and parameters {pi},q. Assume p m i n > q and 
ki > A(p 2 -q) = f or a ^ * — Then, (independent of rest of the parameters) there exists constants c, C > 

such that as a result of convex program \15]) we have 

t 

T -rinxn \ ^ -nnxn /oo\ 

l q — i n = y (zz) 

S = l^ n \ (23) 

iw'i/i probability at least (w.p.a.l.) 

1 - cn 2 exp(-C(p min - q) 2 k min ) (24) 

In Theorem [1] one can simplify the condition on {ki} by simply requiring fc m m > p 4y ^"~q however statement of 
the theorem will be weaker unless all {??i}'s are equal. Following corollary gives an idea about the case where we 
observe the partial graph. 

Corollary 1 (Result for Partially Observed Graphs). Let Q be a random graph as described in Theorem [7] and 
we observe each edge of Q with probability r independent of the other edges. Let A' be the adjacency matrix of the 
observed subgraph. Then, statement of Theorem^ holds with variables A',{^},g' instead of A.,{pi},q respectively 
where q' = rq and p^ — rpi for all i < t. Hence for recovery (w.h.p.), we require ki > f or a M i <t. 

Theorem 2 (Main Result for Blind Approach). Let p m i n > h > q an d G be a random graph generated according to 
the probabilistic cluster model with cluster sizes {ki}\ =1 . Set A = -q= and assume fcj > 2 p V ?L f or a ^ * — ^- Then, 
there exists constants c, C > such that as the output of problem we have 

L - l^ x " (25) 

qO -nnxn Tinxn (OR\ 

J — 1 AnTZ'= ATVR. 

with probability at least 

1 - cn 2 exp(-C(min{2p mi „ - 1, 1 - 2q}) 2 k min ) (27) 

We should emphasize that slightly stronger results can be given for both theorems. For example, we can reduce 
the lower bound required for {ki} by a factor of four in both theorems at the expense of the error exponent C. In 
fact, one can get even better lower bounds for {ki} by choosing A as a function of {pi},q however we preferred to 
make A independent of {pi},q. 

The following theorem provides a converse result for blind method. 

Theorem 3. Let Q be a random graph generated according to the probabilistic cluster model with {pi}, q and assume 
c 

\/Ti 



Pmin > 9- A = -y= for some constant C > 0. Then, if 



7; > Pmin or 1 > \ and K ^ N x M ( 28 ) 
as n — > oo, (L ,S ) given in t!3\) is not a minimizer of problem MO) with probability approaching 1. 
Remark: Note that if 72. = [n] x [n] there is nothing to solve as all nodes are in the same cluster. 



5 Future Extensions and Conclusion 
5.1 Simulation Results 

We considered two relatively small cases. For the first case, we have t = 2, n = 64, c\ = C2 = 28, q = 0.15 and 
Pi = pi = p is variable. We plotted the empirical probability of success for both methods as a function of p in 
Figure Q] 

Secondly, in order to illustrate the difference between intelligent and blind approaches, we set t — 1, n = 50, 
c\ = 40, q = 0.10 and varied p\ = p. Due to Theorem [31 for blind approach to work, we always need p > 1/2. 
On the other hand, intelligent approach will work for any p > q as long as k m i n is sufficiently large. Hence, when 
we increase k m i n we expect to see a better recovery region for intelligent approach compared to blind. We should 
remark that, in a probabilistic setting, t = 1 case is trivial as we can find the cluster with high probability by 
looking at the nodes with high degree. Empirical curve is given in Figure [2] 

Remark: In order to keep the model size n small, we used A = -4= in both of the simulations. 
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Figure 1: Two methods perform close to each other. Also observe that phase transition is sharp and for p > 0.8 both methods succeed w.h.p. 

5.2 Future Extensions 
5.2.1 Alternative approaches 

Our simulation results indicate that a slight modification to problem (1) of [5] might be an alternative to the 
methods analyzed in this paper. Let e € M" be the vector of all l's. Then assuming we know the number of 
clusters t, proposed convex program is as follows 



max sum(L_4) 

subject to 

L y (positive semi-definite) 
trace(L) = t 

Lij > for all 1 < i,j < n 
(Xe)i < ei — 1 for all 1 < i < n 



The desired solution of this problem is 



* 1 



•1 



E, 



(29) 

(30) 
(31) 
(32) 
(33) 
(34) 

(35) 



It is easy to see that L° is feasible. Program (|2"9")l might be a more useful approach compared to ([15)) as it requires 
number of clusters t as a prior information instead of \1Z\. However, we only considered "low rank + sparse" 
decompositions in this paper. 



5.2.2 Removing the disjointness assumption 

As a natural extension, we consider removing the assumption of disjoint clusters. When clusters are allowed to 
intersect, intuitively l^ xn is no longer low rank. Although, we don't provide a proof, we believe rank of l^ x ™ is 
equal to the number of distinct nonempty sets of type f~} ieS Ci x d where SC. [t]. This suggests rank of l^ xn can be 
as high as 2* — 1 which grows exponentially with number of clusters. This intuition is verified by simulation results. 
Consequently, convex programs (H"5|) and (flO|) might not be good candidates when clusters are allowed to intersect 
as we aim to find l^ x ™ as a solution in these approaches. As a result, an alternative approach which will naturally 
result in a low rank solution is of significant interest. Another related problem is, when clusters can intersect, how 



G 




Fi gUie 2; Intelligent method succeeds for p > 0.6 and blind succeeds for p > 0.8. As wc let k m in — > oo intelligent and blind methods will 
succeed for p > q — 0.1 and p > 0.5 respectively. 

to obtain {Ci}* =1 from the knowledge of 1Z assuming we are able to find 72. as a result of the optimization. Certainly, 
we may not always be able to uniquely decompose 1Z into {Ci}, but in general decomposition which yields smallest 
number of clusters might be of interest. 

5.2.3 Extremely sparse graph 

In many cases {pi}, q decays as the model size grows. For example, in order to have a connected graph with high 
probability, Erdos-Renyi model with edge probability r requires only r > i.e. average node degree of ln(n). 

Sparse graphs are very common and useful in social networks |16j and web graphs |il7] hence it would be of interest 
to extend results of this paper to the setting where {pi},q are not constant. We believe this can be done by using 
concentration results specific to the spectral norm of sparse matrices. 

5.3 Final Comments 

In this paper, we analyzed two novel approaches for detection of disjoint clusters in a general probabilistic model. 
Our results are consistent with the existing works in literature and significantly extend results of [5] , [3] . Simulation 
results suggest that even for a relatively small model, our methods yield the desired result with high probability. 

6 Proof of Theorem [1] 

Analysis of problems (fT5j) and (fTt)|) arc similar to a great extent. Therefore, many of the results for this section will 
also be used for section [7] In the following discussion, A and p m in — q is always assumed to be positive. 

6.1 Perturbation Analysis for (L°,S°) 

Let (L*,S*) denote the optimal solution of problem (JX5J . We'll follow a conventional proof strategy to show that 
under some conditions, for any feasible nonzero perturbation (E L ,E ) over (L°,S°) given in (|1Q[) . the objective 
function strictly increases i.e. 

H^ + ^IU + All^ + ^lli > ||L |U + A||S°||a (36) 
Consequently, due to convexity we'll conclude (L* , S*) = (L°, S°). 
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6.1.1 Observations 

Lemma 1. For optimal solution of problem ] 151 we have S* = L\ a . 
Proof. From (I17|) we have 

sum((L* - S*) A c) = (37) 
This follows from the fact that l nx " A = l^* n . Combining this with (fTo| . we can conclude that 

L % = S^a (38) 

since L* j > S*j for all i,j € [n]. 

Secondly, one can observe that if (L,S) is feasible for problem (fT5|) . then (L, Sa°) is also feasible and gives a 
lower (or equal) cost. This is because: 

• The only constraint on entries of S over A is Lij > Sjj > and $y = will trivially satisfy this. 

• 115111 = || S_a. ||i + ||S4c||i > ||5^c||i with equality if and only if Sa — 0. Therefore, the objective will not 
increase by substituting S by Sa c ■ 

Hence for optimality, we require: 

S* A = (39) 
Using (|38|) and (f39|) . WLOG, 5 takes the following simple form: 

S* = L% (40) 



A natural interpretation of HOI is that, the only role of S is filling the missing edges inside the clusters. Actually, 
we can write a simpler and equivalent optimization, where we get rid of the variable S; but still get the same result 
as problem (fT5"|) . as follows 

nnn||L|U + A||ZU.||i (41) 

subject to 

1 > Li d > for all i,j 
sum(L) > \K\ 

Finally notice that (L°,S°) satisfies (|4U|) as expected. 
6.1.2 Optimality Conditions for (L°,S°) 

Let (•,•) denote the usual inner product i.e. (X,Y) = trace(A T F) = J2i j • Also sign(-) : M" x ™ — > 

{-1,0, 1}™ X " such that 

( 1 if Xij > 

sign(X)ij = j if Xij = (42) 
[-1 ifX itj < 

We would like to show any feasible nonzero perturbation (E L ,E S ) over (L°, S°) will strictly increase the objective. 
Due to Lemma [1] we can assume 

E s = E% (43) 

as (L°, S°) satisfies (j40|) . In the following discussion, we analyze the increase in the objective due to the perturbation. 
Increase due to E s : Similar to [4], by using the subgradient of the t\ norm we can write: 

||S° + i? 5 ||i> ||S ||i + (sign(S ) + Q,£ s ) (44) 

for all HQIloo < 1) Qat\TZ — as S° is nonzero over A c DTZ. Here || • ||oo is the infinity norm, i.e., H-X^cc = 
maxi<ij-<„ |Aij-|. 

Note that sign(5°) = S° = Then ' b y choosing Q = l?%? n7i )c and using (03} in gH), we find: 

j|S + £ S ||i>||S°||+sum(i^) (45) 
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Increase due to E L : Let u; G R™ be the characteristic vector of Ci with unit norm i.e. for 1 < i < n, i'th entry 
of u; is 

" {?*. (46) 

Let U = [m ... u t ] G JfT xt and A4u ={Ie M" x ™ : IU = X T U = 0}. Also || • || denotes the spectral norm, i.e., 
the maximum singular value. Then, following lemma characterizes the increase in the objective due to E L . 

Lemma 2. For any E L and W with \\W\\ < 1, W G Mu we have 

\\L° + E L \U> \\L% + J2 T sum i E k t ) + ( E ^ W ) ( 4? ) 
i=i 1 

Proof. Singular value decomposition of L° can be written as 

"fci 

t 

^fc ( u ; uf = U 

i=i 

as a result columns of U, {u;}* =1 , are the left and right singular vectors of L°. Then, we have 

\\L° + E L \\*> \\L% + (E L , W + UU T ) (49) 

for any W G Mu with ||W|| < 1, which follows from the subgradient of the nuclear norm, similar to |3]. Finally, 
observe that 



U T (48) 



h x ' ' ^ k t 

i=i i=i 



to conclude. 



Overall increase: By combining (j45|) and Lemma [21 we have the following lower bound for the increase of the 
objective: 

(\\L° + E L \\+- \\L°\U) + X(\\S° + E% \\S%) > T sum ( E kj + M4) + (E L , W) (51) 

i=i 1 

for any W G Mu, \\W\\ < 1. Then, as long as the right hand side of (|5Tj) can be made strictly positive for all 
feasible nonzero E L (by properly choosing W), (L°, S°) is the unique optimal solution of problem ([To]) . Let us call 

- 1 

f(E L , W)=Y, T smn ( E ki ) + Asum(E^) + {E\ W) (52) 
i=i 1 

6.1.3 Main Cases 

The following lemma will help us separate the problem into two main cases. 

Lemma 3. Given E L , assume there exists Wo G Mu with \\Wo\\ < 1 such that f(E L ,Wo) > 0. Then at least one 
of the followings holds: 

• There exists W* G Mu with \\W*\\ < 1 and f(E L : W*) > 

• For all W G Mu, {E L , W) = 0. 

Proof. Let c = 1 - ||W ||. Assume (E L , W) ^ for some W G Mu- Since (E L ,W) is linear in W, WLOG, let 
(E L ,W) > 0, ||W|| = 1. Then choose W* = Wq + cW. Clearly, ||W*|| < 1, W* G Mu and 

f(E L ,W*) = f(E L ,W ) + (E L ,cW') > f(E L ,W ) >0 (53) 
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Notice that, for all W € A^u, (E L , = is equivalent to E L G A^u which is the orthogonal complement of 
A^u in M nx ™. A^u nas the following simple characterization: 

M{j = {X G R nxn : X = UM T + NU T for some M, N G R nxt } (54) 

In the following discussion, based on Lemma H as a first step, in section 16.21 we'll show that, under certain 
conditions, for all E L G Ai^j with high probability (w.h.p.) 

- 1 

g(E L ) = J2 ^sum(4 w ) + Asum(^) > (55) 

i— 1 

Secondly, in section [6731 we'll argue that, under certain conditions, there exists a W G A4u with \\W\\ < 1 such 
that w.h.p. f(E L ,W) > for all feasible E L . This W is called the dual certificate. Finally, combining these two 
arguments, we'll conclude that (L°, S°) is the unique optimal w.h.p. 



6.2 Solving for E L G case 

In order to simplify the following discussion, we let 

* 1 

9i(X)=J2-k Snm{XK '' l) (56) 

i— 1 ^ 

92 (X) = sumpOtc) 

so that g(X) = gi(X) + Xg2(X) in ([5"5j) . Also let V = [vi ... Vt] where = ^/klui. Thus, V is basically obtained 
by, normalizing columns of U to make its nonzero entries 1. Assume E L eM^j. Then, we can write 

E L = VM T + NV T (57) 

Let mi,rii denote i'th columns of M,N respectively. Notice that sum(L°) = \TZ\ hence from (fT8|) 

sum(£ L ) > (58) 

Similarly, from L° and (fT6|) it follows that 

E^o is (entrywise) nonnegative (59) 
E^ is nonpositive 

Now, we list some simple observations regarding structure of E L . We can write 

E L = ]T>mf + n iV f ) = E^ (60) 

i=l i=l 3 '=1 

Notice that E^. . is contributed by only two components which are: v^m^ and njvj. 

Let {flij jjLi be an (arbitrary) indexing of elements of Ci i.e. = {a^i, . . . , cij,fc 4 }. For a vector z G R" let 
z 1 G M fci denote the vector induced by entries of z in d. Basically, for any 1 < j < hi, z* ■ — z aij . Also, let 
E % 'i G R feiXfe j which is E L induced by entries on IZij. In other words, 

Ei' j d = E% t o)0 . d for any (i, j) G C; x C 3 and for any 1 < c < fc l; 1 < d < fc,- (61) 

Basically, E % ^ is same as Ej^. . when we get rid of trivial zero rows and zero columns. Then 

E ij = ife. m J T + n jl^ T (62) 

Clearly, given {-E iJ }i<ij<n, £^ is uniquely determined. Now, assume we fix sum(2?'' J ') for all i, j and we would 
like to find the worst E L subject to these constraints. Variables in such an optimization are m^n,;. Basically we 



10 



are interested in 

mmg(E L ) (63) 

subject to (64) 

sum(£J lJ ) = a j for all i,j (65) 

E i,j j nonnegative if i ^ j ^ 
1 nonpositive if i = j 

where {cij} are constants. Constraint (|6"6")l follows from ([5^|l . Essentially, based on (|55|). we would like to show 
that with high probability for any nonzero E L with ^ . Cjj > Owe have g(E L ) > 0. Remark: For the special 
case of i = j = t + 1, notice that E 1 ^ = 0. 

In (|6"3")l , gi(E L ) is fixed and equal to r~ c M- Consequently, based on ([55]). we just need to do the optimiza- 

tion with the objective g 2 (E L ) — sum(E^ c ). 

Let ftij C [fcj] x [fcj] be a set of coordinates defined as follows. For any (c, d) E [fcj] x [fcj] 

(c, cf) e iff (ai )C , Oj,d) 6 .4 (67) 

For («i,ji) 7^ (12,^2), ( m ii> n }i) an d ( m i2' n j2) are independent variables. Consequently, due to (|62|) . we can 
partition problem (|63l) into the following smaller disjoint problems. 



min sum(_Eoc ) 

subject to (69) 
sum(£ ,J ) = Ci j (70) 

j nonnegative if i ^ j 
1 nonpositive if i = j 

Then, we can solve these problems locally (for each i,j) to finally obtain 

ff2 (i? L <*)=£sum(i^) (72) 

to find the overall result of problem (p3")l . where * denotes the optimal solutions in problems (1531 and (pSj) . The 
following lemma will be useful for analysis of these local optimizations. 

Lemma 4. Let a e M c , b € R d and X = l c b T + al dT &e variables and Cq > be a constant. Also let j3 C [c] X [d]. 
Consider the following optimization problem 

min swm(X i a) (73) 

a.b 

subject to (74) 

ly > /or a/Z i,j (75) 

sum(X) = Co (76) 

For i/i/is problem there exists a (entrywise) nonnegative minimizer (a , b ). 

Proof. Let Xi denotes i'th entry of vector x. Assume (a*,b*) is a minimizer. WLOG assume b\ = min^j ja*, b*}. 
If 6* > we are done. Otherwise, since X^j > we have a* > —b\ for all i < c. Then set a = a* + \ c b\ and 
b° = b* l d 6*. Clearly, (a°,b°) is nonnegative. On the other hand, we have: 

X* = l c h* T + a*l dT = l c b° T + a°l dT = X° =>■ sum(XS) = sum(Jf2) = minimum value (77) 



Lemma 5. A direct consequence of Lemma^4\ is the fact that in the local optimizations \68\). WLOG we can 
assume (rrr],n*-) entrywise nonnegative whenever i ^ j and entrywise nonpositive when i = j. This follows from 
the structure of E 1 ^ given in h62\) and \59\l . 
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Following lemma will help us characterize the relationship between sum(E l,J ) and smn(Egi ). 

Lemma 6. Let (3 £ M. cxd be a random support with parameter < r < 1. Then for any e > w.p.a.l. 1 — 
dexp(— 2e 2 c) for all nonzero and entrywise nonnegative a £ M. d we'll have: 

sum{X p ) > (r - e)sum(X) (78) 

where X = l c a T . Similarly, with same probability, for all such a, we'll have sum(Xp) < (r + e)sum(X) 

Proof. We'll only prove the first statement (1751) as proofs are identical. For each i < d, ai occurs exactly c times 
in X as i'th column of X is \ c ai. By using a Chernoff bound, we can estimate the number of coordinates of i'th 
column which are element of /3 (call this number Ci) as we can view this number as a sum of c i.i.d. Bern(l, 0, r) 
random variables. Then 

P(C 4 < c(r - e)) < exp(-2e 2 c) (79) 
Now, we can use a union bound over all columns to make sure for all i, Ci > c(r — e) 

P(C 4 > c(r - e) for alH < d) > 1 - dcxp(-2e 2 c) (80) 

On the other hand if each Ci > c(r — e) then for any nonnegative 

d d 

sam(Xp) = x hj = c t a t > c ( r ~ e ) Y ai = ( r ~ e ) sum W ( 81 ) 

(i,j')e/3 <=1 *=1 



Using Lemma [6l we can calculate a lower bound for g(E L ) with high probability as long as cluster sizes are 
sufficiently large. Due to (J60J) and the linearity of g(E L ), we can focus on contributions due to specific clusters i.e. 
Vitiif + mvf for the i'th cluster. We additionally know the simple structure of m^, from Lemma[5] In particular, 
subvectors m* and n* of m.;, can be assumed to be nonpositive and rest of the entries are nonnegative. 

Now, we define an important parameter which will be useful for subsequent analysis. This parameter can be 
seen as a measure of distinctness of the "worst" cluster from the "background noise". Here background noise 
corresponds to the edges over 1Z C . 

1, 1. 

e = rmn-(pj - q- — ) 82) 
l<t 2 k[\ 

The following lemma, gives a lower bound on giyimj). 

Lemma 7. Assume e > 0. Then, w.p.a.l. 1 — n exp(— 2e 2 (fc; — 1)), we have g(vimf) > A(l — q — e)sum(vimf) 
for all m/. Also, if mi 7^ then inequality is strict. 

Proof. Let us call X 1 — \ kl m % ^ . Also m\ is nonnegative for i ^ I and nonpositive for i = I. Then 

ff(vjmf) = — sum((vimf)- R , ! ) + Asum((v/mf ) A c) (83) 
h 

l * 

= -sum(l fei mf ) + V Asum((l fei mf )«= ) (84) 
h f-f 

2 — 1 

1 * 

= — sum(X') + 2J Asum(^ f .) (85) 



/?/,.; is a random support with parameter q \i i ^ I and corrected random support with parameter p if i = I. For a 
fixed i < t + 1, from Lemma [6] w.p.a.l. 1 — fcj cxp(— 2e 2 (fc/ — 1)) we have 

sum ( ^ f .)>/;;^7> Um ^ if ;.^ (86) 
I (1 — pi + e)sum(A') if z = I 



12 



Then, using a union bound w.p.a.l. 1 — nexp(— 2e 2 (fc/ — 1)) we have (|86[) for all i and ni;. Combining this with 
(|85]l . we get 

g(vimf) > AjJ(l - ? - e)sum(JT) + (^ + A(l - + e) j sum(X') (87) 
t+i 

> A(l - g - e) ^ sumpT) = A(l - g - e)sum(v;mf ) (88) 
i=i 

If m/ ^ 0, inequality (|5t?)) is strict for some 1 < i < t + 1 due to Lemma [51 Hence, (1571) will be strict too. ■ 

As we have mentioned in sectional let k m i n denote the size of the minimum cluster, which will be an important 
parameter for rest of our analysis. Following theorem is based on Lemma[7Jand gives the main result of this section. 

Theorem 4. Let e be same as described in \82j). Assume A and {ki} are such that e > 0. Then w.p.a.l. 1 — 
2niexp(-2e 2 (fc mm - 1)), for any E L ^ with E L £ and sum(E L ) > we have g(E L ) > 0. 

Proof. Due to Lemma[71 for a particular I, w.p.a.l. Pi = 1 — ncxp(— 2e 2 (ki — 1)) we have 

ff(v,mf) > A(l - q - e)sum(vimf ) (89) 

and an identical result holds for nivf term. 

Now union bounding over all {m;}, {n;}, we can obtain w.p.a.l. 

t 

1 - 2ntcxp(-2e 2 (fc mm - 1)) < 1 -2^(1 - Pj) (90) 

i=l 

for all / < t (JHHI) holds, hence going back to plf 

t 

g{E L )=Y J 9{vi*nJ + n,vf) (91) 

> A(l - g - e)[sum(v ; mf ) + sum(n/vf )] (92) 
= A(l-g-e)sum( J B L ) >0 (93) 

On the other hand, if E L ^ then at least one of {mi}, {n;} is nonzero and inequality (|92p is actually strict. ■ 

Hence, the main result of this section is the fact that, as long as A and the cluster sizes {ki} are sufficiently 
large, we don't need to worry about feasible perturbations of type E L e A^u- 

6.3 Showing existence of the dual certificate 

In this section, we'll treat the second case. Our aim is showing the existence of a W E A4xj with \\W\\ < 1 such 
that f(E L ,W) > for all feasible E L . We follow an approach consisting of three steps: 

• Construct a candidate W which satisfies f(E L , Wo) > for all feasible E L . 

• Show that || Wo || < 1 under certain conditions. 

• Slightly modify Wq to obtain W which still satisfies the previous conditions, but also obeys W £ Aijj. 

6.3.1 Candidate W 

Recall that 

* 1 

f(E L ,W) = ^T-sum^J + M4) + (E L ,W) (94) 
1=1 1 

Using approaches similar to [2] and [4], we'll construct a W based on the following candidate 

t 

W = cl" XIl + Al^ x ™+J]c.a^™ (95) 
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Here c, {c;}* =1 are the variables that will be used to construct the desired W. Now, let us see, why this Wo is an 
intelligent choice 

* 1 

f{E L , W ) + -)sum(^ i .) + (A + c)sum(E L ) (96) 

?— 1 

Notice that if c, + p- < and A + c > we are done since sum(E L ) > and sum(£^. ) < for all i. Obviously, 
one needs to do this, while ensuring ||Wo|| is as small as possible. 

In (|95p. for constant c, {c;},A, Wo is a random matrix with i.i.d. entries due to the l^ xn term as graph is 
randomly generated. An intuitive way of ensuring small ||Wo|| is to force expectation of Wo to 0. In order to ensure 
the expectation of the entries inside the region 1Z C is we need 

(X + c)q + c(l - q) =0 (97) 

Hence c = —Xq. Now setting expectation over IZi^ to 0, we find 

(c< + A + c)pi + (cj + c)(l - ft) = (98) 

Hence 

c, = -c(l - Pi ) - (c + A)^ = Xq{\ -p)- A(l - g)p = A(<7 - Pi ) (99) 
Now notice that we satisfy A + c = A(l — q) > 0. In order to satisfy c, + — < we need 



2e = min 

i<t 



p *- q -xk< 



> (100) 



The reader will remember that this is the same constraint we needed for Theorem |4j With these choices of {c^}, c 
we have 



and 



Wo = x(- q i nxn + i n / n + J2(q - (ioi) 

i=l 

= A(X)[(1 -p,)!^ -wl^J + [(1 - g)l^„ - ?l^o]) (102) 

* 1 

/(£ L , W ) - A(l - q)sum(E L ) - y2(X( Pi - q) - -)sum(4 .) (103) 

Assuming (|100|) . since sum(£ L ) > and sum(£^.) < 0, for any feasible £' i , f(E L , W ) > 0, thus W is indeed a 
good choice. However, there are two problems to be solved. 

• Making sure that ||Wo|| is sufficiently small. 



•'Correcting" W so that W £ Mu while still ensuring f{E L , W ) > for all E L 



6.3.2 Bounding the spectral norm 

Following lemma addresses the first problem and gives a simple bound on ||Wo||. 
Lemma 8. Recall that Wo is a random matrix where randomness is on A and Wq is given by 

t 

Wo = A(£[(l -Pi)lT™ M -Pi^ZnnJ + 1(1 - - (104) 

2 = 1 

Then, for any e > 0, w.p.a.l. 1 — 4exp(— e 2 ^) we /iave 

||Wo|| < (l + e + o(l))AVn (105) 
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Proof. tWq is a random matrix whose entries are i.i.d. and distributed as Bern(— pi, 1 — pi, 1 — p{) on T^j,, and 
Bern(— g, 1 — g, 1 — q) on 7?. c . Then variance of an entry is at most max{{pi(l — Pi)}* =1 , g(l — q)} < 1/4 hence we 
can use Theorem 1.5 of [15] to find 

median(||ivK ||) < (2y/max{{ Pi (l - Pl )}, q(l - qj} + o(l))Vn < (1 + o(l))Vn (106) 

On the other hand, since absolute values of entries of jW are bounded by 1, Theorem 1 of [5] gives 

>P[||W || >A(l + e + o(l))V^ (107) 



4exp(-e 2 ^)> 



\jW \\ >median(|| jW \\) + 



Lemma [8] verifies that asymptotically with high probability we can make ||Wo|| < 1 as long as we choose a 
proper A which yields sufficiently small Xy/n. However, Wo itself is not sufficient for construction of the desired W, 
since we don't have any guarantee that Wo E Mjj. In order to achieve this, we'll correct Wo by projecting it onto 
M\j. Following lemma suggests that we don't lose much by such a correction. 

6.3.3 Correcting the candidate Wo 

Lemma 9. Wo is as described previously in \104\) . Let W H be projection of Wo on A4\j. Then 
• \\W H \\ < 1 1 Wo 1 1 



For any e > 0, w.p.a.l. 1 — 6n exp(— 2e k m i n ) we have 



||Wb - W H ||oo < 3Ae (108) 



Proof. Choose arbitrary vectors {ui}™ =t+1 to make {ui}™ =1 an orthonormal basis in M™. Call U2 = [ut+i . . . u„] 
and P = UU T , P 2 = U 2 U^. Now notice that for any matrix X € R nx ™, P 2 XP 2 is in Mu since U T U 2 = 0. Let 
I denote the identity matrix. Then 

X - P 2 XP 2 = X - (I - P)X(I P) = PX + XP PXP e Mv (109) 

Hence, P 2 XP 2 is the orthogonal projection on AAjj. Clearly 

\\W H \\ = ||P 2 VK P 2 || < ||P 2 || 2 ||W || < llWoll (110) 

For analysis of \\Wo — W H \\ oa we can consider terms on right hand side of (| 109[) separately as we have: 

|| Wo - W H |U < llPWolU + IIWoPlU + IIPWoPHoo (111) 

Clearly P = Y^\=i T^s^? ■ Then, each entry of jPWo is either a summation of ki i.i.d. Bern(— pi, 1 — pi, 1 — pi) 
or Bern(— q, 1 — q, 1 — q) random variables scaled by k~ x for some i < t or 0. Hence any c,d € [n] and e > 

P[|(PW )cd| > Ae] < 2exp(-2e 2 /c mm ) (112) 

Same (or better) bounds holds for entries of WoP and PWqP. Then a union bound over all entries of the three 
matrices will give w.p.a.l. 1 — 6n 2 exp(— 2e 2 k m i n ), we have ||Wo — W H \\oo < 3Ae. ■ 

6.3.4 Summary of section 16.31 

Lemma [5] suggests that actually Wo can be corrected with an arbitrarily small perturbation. This will be useful in 
the following theorem which summarizes main result of this section. 

Theorem 5. Wo and e are as described previously in ^104^ , (EW respectively. Choose W to be projection of Wo 
on A4\j. Also set X — and assume {fej}f =1 is such that e > 0. 
Then, w.p.a.l. 1 — 6n 2 exp(— |e 2 /c m i„) — 4exp(— we have 

• IIWII <1 
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• For all feasible E L , f(E L ,W)>0. 
Proof. First consider Lemma [5J Let e = Then w.p.a.l. 1 — 4exp(— j^) we have 

||W|| < ||Wo|| <(l + e + o(l))AVn < 1 



(113) 



Now, assume, we have \\Wo — W\ 



\€q. Then, using (I5T)1) . for any E L , 



we can write 



(Wo - W, E L ) < Ae (sum(£^c) - sum(4)) 
= Ae (sum(£; i ) - 2sum(£^)) 



(114) 
(115) 



Now, we consider (JTU3J). As long as eo < min{l — g, e} = e, for any feasible E L we have 

f(E L , W) = f(E L ,W ) - (Wq - W,E L ) > f(E L ,W a ) - Xe (sum(E L ) ~ 2snm(E^)) 

- 1 

= A[(l - q - e )sum(E L ) - ^( Pl -q- — - 2e Q )sum(E^J] > 



(117) 



(116) 



Hence W satisfies the desired condition. Lemma [9] gives the following concentration for \\Wq — W\ 



oo 



P[||W - W||oo > Ae] < 6n 2 exp(--e 2 fc mm ) 



(118) 



Finally, a union bound over the failure of events ||W|| < 1 and || Wo — W||oo < Ae gives the result. ■ 

Theorem [S] concludes this section because our aim throughout the section was constructing such a W w.h.p. As 
a final step, we combine, Theorems [4] and [5] and Lemma [3] to deduce the main result for the intelligent approach. 

6.4 Final step 

Following theorem finishes proof of Theorem Q] by combining Theorems |4] and [5l 

Proof of Theorem [U For the following discussion Ci, 62,63 > are the suitable constants for the previous 
theorems. Let e be same as before. Then e > mhij<4 ^j 2 - and statements of Theorem 2] will hold w.p.a.l. 
1 - 2nt exp(-Ci(p mi „ - q) 2 k min ) and 1 - 6n 2 exp(-C 2 (p mm - q) 2 k mm ) - 4exp(-C 3 n) respectively. Then using a 
union bound and n > ki both statements hold with w.p.a.l. 1 — (8n 2 + o(l)) exp(— min{Ci, C2, C^}{p m i n — q) 2 k m i n ) 
and we have 

• From Theorem [4j for any nonzero E L e Mij, g{E L ) = f(E L ,0) > hence objective increases. 

• Otherwise, due to Theorem[5l there exists a || W\\ < 1, W € A^u such that for any E L , f(E L , W) > 0. Then, 
from Lemma[3l there exists W* € A^u with ||W*|| < 1 such that f(E L , W*) > hence objective increases. 

Then for all E L , objective increases which implies (Lq, So) is the unique optimal solution of problem 1151 ■ 



We'll follow almost the same approach and notation in section [6] We aim to show (L°, S°) given in (fl~3|) is unique 
optimal to problem [POl 

7.1 Perturbation analysis 

Lemma 10. Let (E L ,E ) be a feasible perturbation. Then, objective will increase by at least 



7 Proof of Theorem [2] 




(119) 



for any W G M v , \\W\\ < 1. 
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Proof. Clearly E L = -E s as L° + S° = A. Similar to previous section, for any such W increase in ||£||* satisfies 

\\L° + E L \U- \\L% = J2 F sum (^M) + ( £L > W ) ( 12 °) 
1=1 1 

For sparse component, using sign(5°) = l^nrc c _ ^A^nn an d choosing Q = - 1^"k<= wc nn d: 

||S° - £ L ||i - ||5°||i > (-£ L ,sign(S°) + Q)= Bum{E%) - sum(^) (121) 
Combining these, we get the desired form f(E L , W). ■ 
Notice that we can directly use Lemma [3] Let 

* 1 

g(E L ) = J2 T sum ( E kJ + A(sum(£&) - sum(£#)) (122) 
2=1 1 

Then, we first show w.h.p. objective strictly increases for all E L E Ai^j and then w.h.p. construct a dual certificate 
W satisfying ||W|| < 1, W € M v and for all feasible E L , f{E L ,W) > 0. 

7.2 Solving for E L e case 

Let gi(X) = Yd=i ^um(X ni l ) and g 2 (X) = sum(I^) - sum(X A ). 

7.2.1 Summary of the similarities with the proof of Theorem [1] 

E L has the form VM T + NV T and M, N, {m,;}, {n^}, {ftj}, {o>i,j} are as described in section[6] Again we consider, 
problem 1631 and since gi(E L ) is fixed, we just need to optimize over g%{E ). This optimizations can be reduced to 
local optimizations EH1 Since L° = l^ xrl , (|59|) applies for E L and we can make use of Lemma [5] and assume mj is 
nonpositive/nonnegative when i = lji^l for all i, I. Hence, using Lemma[S]we lower bound g(\~imf) as follows. 

7.2.2 Lower bounding g(E L ) 

For the purpose of this section, we set e as follows: 

e = i min{l - 2g, {2 Pi - ^- - 1}* =1 } (123) 

Lemma 11. Assume, I < t, e > 0. Then, w.p.a.l. 1 — ncxp(—2e 2 (ki — I)), we have g(vim.J) > for all mi. Also, 
if m/ 7^ then inequality is strict. 

■T 

can write 



Proof. Recall that m/ satisfies m\ is nonpositive/nonnegative when i = l/i ^ I for all i. Call X 1 — t kl ml . We 
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(v,mf) - Uum(X l ) + Y.\h(X\Pl l ) (124) 



i=l 



where h(X l , fif?) = sum(X^ c ) — sum(X^ ; ). Now assume i ^ I. Using Lemma[S]and the fact that /3/.i is a random 
support with q w.p.a.l. 1 — ki exp(— 2e 2 ki), for all X 1 , we have 

h(X\$l>) >(l-q- e)sum(X i ) - (q + e)snm(X i ) = (1 - 2q - 2e)sum(X i ) (125) 

where inequality is strict if X 1 ^ 0. Similarly when i = I we have w.p.a.l. 1 — fc; exp(— 2e 2 (/c/ — 1)) 

-Uum(X') + h(X l ,ffi tl ) >(l- P i + e+ -L)sum(X') - fa - e )sum(X l ) = -(2 Pl - 1 - -L - 2e)sum(X') (126) 
A/c# A/c/ A/c/ 

Choosing e = e and using the facts 1 — 2q — 2e > 0, 2p/ — 1 — — 2e > and a union bound w.p.a.l. 1 — 
rt exp(— 2e 2 (ki — 1)) we have g(-vimf) > and inequality is strict when / as at least one of the X z, s will be 
nonzero. ■ 

Following theorem immediately follows from Lemma [TT] and summarizes the main result of the section. 

Theorem 6. Let e be as in \12S\) and assume e > 0. Then w.p.a.l. 1 — 2nt exp(— 2e 2 (k m i n — 1)) we have g(E L ) > 
for all nonzero feasible E L G -My. 



17 



7.3 Showing existence of the dual certificate 

Again, we'll follow quite similar steps to section lrJT51 Recall that 

* 1 

f(E L , W) = V T sum ( E kJ + ( EL ' W ) + Hsum(E^) - sum(^)) (127) 

2 — 1 

W will be constructed from the candidate Wo as follows. 

7.3.1 Candidate W 

Based on convex program I1Q[ we propose the following form 

t 

1711 \ ^ -nnxn , -nnxn , \/-nnxn -nnxn\ /ioo\ 

W = 2^c z l n ^ +cl nc +\{1 A -l A c ) (128) 

i=l 

where {cj}f =1 ,c are variables. In this case, we'll have f(E L , Wo) = 2* = i(cj + ^-)1^™ + cl^™ and when a < 
and c > using (|59p we'll have f(E L , Wo) > for all -E L as desired. Wo is a random matrix where randomness is 
due to A and in order to ensure a small spectral norm we set its expectation to 0. Expectation of an entry of Wo 
on lZi t i and 1Z C is a + X(2pi — 1) and c + X(2q — 1) respectively. Hence 

Ci = -A(2pj - 1) and c = -\(2q - 1) (129) 

and / and Wo take the following forms 

1 1 

f(E L , Wo) = A[(l - 2q)sum(E n .) - V(2 Pi - 1 - -)sum(% J] (130) 

i=i Afc ' 

Wo = 2A[^(l- ft )l^ -ftl^. + (1 - g)l£*^ - gl^-l (131) 

Hence we require X(2pi — 1) > k 1 . and 1 > 2q. Notice that Wo has the same form (|104[) analyzed previously. 
Consequently, Lemma [5] directly applies and || Wo|| is bounded above by 2(1 + e + o{\))\^/n w.h.p. 

7.3.2 Summary of section [7731 

Luckily, Lemma [9] also directly applies as form of the Wo is exactly same as in section [6l As a result, we can state 
the following Theorem. 

Theorem 7. Wo is as described previously in \131}) . Choose W to be projection of Wo on A4u . Also set A = 
and let e be same as in Theorem^ and assume {hi} is such that e > 0. 
Then, w.p.a.l. 1 — 6n 2 exp(— |e 2 /c mi „) — 4exp(— j^) we have 

• < 1 

• For all feasible E L , f(E L , W) > 0. 

Proof. Exactly similar to the proof of Theorem[S]w. p. a. 1. 1 — 4exp(— ^) we have || W|| < 1. Secondly from Lemma 
|w.p.a.l. 1 - 6n 2 exp(-|e 2 fc mm ) we have \\W - W||oo < 2Ae. Then based on (ITBTif for all E L 

f(E L ,W) = f{E L ,Wo) - (Wo - W,E L ) > f(E L ,W ) - Ae(sum(^) - sum(E^)) (132) 

* 1 

= A[(l - 2g - e)sum(4=) - V(2p; - 1 - — — e)sum(4, 4 )] > (133) 

%—\ 

Hence by a union bound W satisfies both of the desired conditions. ■ 
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7.4 Final Step 

Proof of Theorem [2j Notice that A = and fcj > implies 

2e = min{l - 2q, {2 Pl - 1 - ^}- = J > min{l - 2 g , = min{l - 2q,p mtn - 1/2} (134) 

Then based on Theorems [5] and [7] w. p. a. 1. 1 — cn 2 exp(— C (min{l — 2q, 2p m i n — l}) 2 k m i n ) 
• For all nonzero E L G we have g(E L ) > 0. 



There exists W € Xu with \\W\\ < 1 s.t. for all E , f(E , W) > 



Consequently based on Lemma [31 (L°, 5*°) is the unique optimal of problem 1101 ■ 

8 Proof of Theorem [3] 

Proof of Theorem [3j For the proof, we'll construct a feasible [L l , S 1 ) which yields a lower objective value w.h.p. 
Consider the first case where ^ > Pmm- WLOG assume {p{\ is ordered decreasingly and p c > ^ > p c +i for some 
c < £ — 1. Then, let L 1 = X>i=i Ir*™ an d S" 1 = A — L 1 . Then difference between objectives is given by 

\\L \U-\\L% + \(\\S°\\i-\\S 1 \\ 1 ) = ]T h + -^sum(l^» r - (135) 

i=c+l » 

where r = U i>c 7^.1^. sum(l^*™ r — -"-I^nr) i s simply summation of |r| independent Bern(l, — l,pj) random variables 
(for some i > c). Hence, means are nonpositive as pj < i and we'll argue w.h.p. for all i > c and fc; / 

ft(CiM) - ^ +sum(l^ Kl , I - « M ) > (136) 

to conclude. There are k 2 such random variables in IZi^ hence a Chernoff bound will give F[h(Ci,A) > 0] > 
1 — Ci cxp(— C2n) for appropriate constants ci, C2 > for any fc^ ^ 0. The reason is, we need a deviation of at least 
xS& f r0 m the mean. By using union bound over events (|136[) . we obtain (|135p is positive w.h.p. 
Ifg>i let L 1 = l nxn and S 1 = -V£ n . Then 

\\L% -\\L% + A(||5°||i - II^Hx) = J2 k i ~ n + 4rSum(l«™ - 1^.) ( 137 ) 

i=i vn 

Note that n — 2<=i = \Ct+i\ where Ct+i was the set of nodes outside of the clusters. Then, we just need to show 
that sum(l^*?^ c — l^*"^) > ^|Ct+i|\Ai to conclude that (L 1 , S 1 ) is strictly better. Similar to the previous case, 
sum(l^^ c - l^c™7jc) is sum of \Tl c \ Bern(l, -1, q) random variables. 
IfC t+ i t^0: Clearly \K C \ > \C t +i\n. Consequently, 

E[sum(l^ c - l"*^.)] > 1^1(2? - 1) > \Ct+i\n(2q - 1) (138) 

and due to Chernoff bounding, it is highly concentrated around the mean. As n — > oo we have \Ct+i\n(2q — 1) >> 
■£j\Ct+i\y/n hence, w.h.p. (|137|) is positive. Error exponent is 0(|Ct_|_i|n.). 

On the other hand, if Ct+i = but \1Z C \ ^ then t > 2 and we have \1Z C \ > 2(n — 1) as for any nonzero integers 
a, b with a + b = n 

(a + b) 2 -a 2 -b 2 = 2ab > 2(a+ b - 1) = 2(n - 1) (139) 

In this case, we only require sum(l^*^ c — 1^4* ™ K c) > 0. Again, this will happen w.h.p. since 2q — 1 > 0. Error 
exponent is \1Z C \ which is fl(n). ■ 
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